296 research outputs found
DNA Hash Pooling and its Applications
In this paper we describe a new technique for the comparison of populations
of DNA strands. Comparison is vital to the study of ecological systems, at both
the micro and macro scales. Existing methods make use of DNA sequencing and
cloning, which can prove costly and time consuming, even with current
sequencing techniques. Our overall objective is to address questions such as:
(i) (Genome detection) Is a known genome sequence present, at least in part, in
an environmental sample? (ii) (Sequence query) Is a specific fragment sequence
present in a sample? (iii) (Similarity discovery) How similar in terms of
sequence content are two unsequenced samples? We propose a method involving
multiple filtering criteria that result in "pools" of DNA of high or very high
purity. Because our method is similar in spirit to hashing in computer science,
we call it DNA hash pooling. To illustrate this method, we describe protocols
using pairs of restriction enzymes. The in silico empirical results we present
reflect a sensitivity to experimental error. Our method will normally be
performed as a filtering step prior to sequencing in order to reduce the amount
of sequencing required (generally by a factor of 10 or more). Even as
sequencing becomes cheaper, an order of magnitude remains important.Comment: 14 pages, 3 figures. To appear in the International Journal of
Nanotechnology and Molecular Computation. Improved background, analysis and
reference
Fast parallel algorithms for the unit cost editing distance between trees
1. Problem Ordered labeled trees are trees whose nodes are labeled and in which the ° left-to-right order among siblings is significant. We consider the distance between two trees to be the minimum number of edit operations (insert, delete, and modify) necessary to transform one tree to another. We present three algorithms to find the distance. The first algorithm is a simple dynamic program-ming algorithm based on a postorder traversal whose complexity improves upon the best previ-ously published algorithm due to Tai (T79 in JACM). The second and third algorithms are parallel algorithms based on the application of suf-fix trees to the comparison problem. The cost of executing these algorithms is a monotonic increas-ing function of the distance between the two trees. Results Let trees T I and T2 have numbers of levels L i and L 2 respectively. Let k be the actual distance between T 1 and T2. Let N be rain (IT11, IT2]). The asymptotic running times (assuming a concurrent-read concurrent-write parallel random access machine) are: A lgor i thm T ime Processors Tai IT l lX [T2[xL~XL] Alg l [Tx [ × Ir=l xLI×L
A Collaborative Approach to Computational Reproducibility
Although a standard in natural science, reproducibility has been only
episodically applied in experimental computer science. Scientific papers often
present a large number of tables, plots and pictures that summarize the
obtained results, but then loosely describe the steps taken to derive them. Not
only can the methods and the implementation be complex, but also their
configuration may require setting many parameters and/or depend on particular
system configurations. While many researchers recognize the importance of
reproducibility, the challenge of making it happen often outweigh the benefits.
Fortunately, a plethora of reproducibility solutions have been recently
designed and implemented by the community. In particular, packaging tools
(e.g., ReproZip) and virtualization tools (e.g., Docker) are promising
solutions towards facilitating reproducibility for both authors and reviewers.
To address the incentive problem, we have implemented a new publication model
for the Reproducibility Section of Information Systems Journal. In this
section, authors submit a reproducibility paper that explains in detail the
computational assets from a previous published manuscript in Information
Systems
Debugging Machine Learning Pipelines
Machine learning tasks entail the use of complex computational pipelines to
reach quantitative and qualitative conclusions. If some of the activities in a
pipeline produce erroneous or uninformative outputs, the pipeline may fail or
produce incorrect results. Inferring the root cause of failures and unexpected
behavior is challenging, usually requiring much human thought, and is both
time-consuming and error-prone. We propose a new approach that makes use of
iteration and provenance to automatically infer the root causes and derive
succinct explanations of failures. Through a detailed experimental evaluation,
we assess the cost, precision, and recall of our approach compared to the state
of the art. Our source code and experimental data will be available for
reproducibility and enhancement.Comment: 10 page
Constellation Queries over Big Data
A geometrical pattern is a set of points with all pairwise distances (or,
more generally, relative distances) specified. Finding matches to such patterns
has applications to spatial data in seismic, astronomical, and transportation
contexts. For example, a particularly interesting geometric pattern in
astronomy is the Einstein cross, which is an astronomical phenomenon in which a
single quasar is observed as four distinct sky objects (due to gravitational
lensing) when captured by earth telescopes. Finding such crosses, as well as
other geometric patterns, is a challenging problem as the potential number of
sets of elements that compose shapes is exponentially large in the size of the
dataset and the pattern. In this paper, we denote geometric patterns as
constellation queries and propose algorithms to find them in large data
applications. Our methods combine quadtrees, matrix multiplication, and
unindexed join processing to discover sets of points that match a geometric
pattern within some additive factor on the pairwise distances. Our distributed
experiments show that the choice of composition algorithm (matrix
multiplication or nested loops) depends on the freedom introduced in the query
geometry through the distance additive factor. Three clearly identified blocks
of threshold values guide the choice of the best composition algorithm.
Finally, solving the problem for relative distances requires a novel
continuous-to-discrete transformation. To the best of our knowledge this paper
is the first to investigate constellation queries at scale
SafePredict: A Machine Learning Meta-Algorithm That Uses Refusals to Guarantee Correctness
International audienc
- …